Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.7 - Equality Statements in Spark and Comparison with Nulls")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None
3 3 2 Maple 2018-11-22 10:05:10 17 white

Filtering Data

When you want ot filter data with more than just one expression, there are a couple of gotchas that you will need to be careful of.

Case 1: Multiple Conditions

(
    pets
    .where(
        (F.col('breed_id') == 1) &
        (F.col('color') == 'brown') &
        F.col('color').isin('brown')
    )
    .toPandas()
)
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown

What Happened?

When there exists more than 1 condition you will to wrap each condition in () brackets and as well provide bitwise operations instead of logical operations in Python.

Why?

This is because in the spark internals they had to overwrite the logical operations and was only left with the bitwise operations. This is to my best knowledge, I could be wrong.

Case 2: Nested Conditions

(
    pets
    .where(
        (
            F.col('breed_id').isin([1, 2]) &
            F.col('breed_id').isNotNull()
        ) |
        (F.col('color') == 'white')
    )
    .toPandas()
)
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 3 1 Chewie 2016-11-22 10:05:10 15 None
2 3 2 Maple 2018-11-22 10:05:10 17 white

What Happened?

Similar to before, nested conditions will need to be wrapped with () as well.

Case 3: Equality Statements with Null Values, (use isNotNull() and isNull())

(
    pets
    .withColumn('result', F.col('color') != 'white')
    .withColumn(
        'result_2', 
        (F.col('color') != 'white') &
        (F.col('color').isNotNull())
    )
    .toPandas()
)
id breed_id nickname birthday age color result result_2
0 1 1 King 2014-11-22 12:30:31 5 brown True True
1 2 3 Argus 2016-11-22 10:05:10 10 None None False
2 3 1 Chewie 2016-11-22 10:05:10 15 None None False
3 3 2 Maple 2018-11-22 10:05:10 17 white False False

What Happened?

If you do not come from a sql background any comparison with Null will also be Null, unless you specifically use the Null comparisons.

The 2 Null comparisons are isNotNull() and isNull().

Summary

  • In spark when using a more involved conditional expression, you will need to wrap each condition with () brackets and use the bitwise operations in Python.

  • Be explicit with you're performing conditional transformations on columns that can be Null.

results matching ""

    No results matching ""